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d • Time series of cell size evolution in unicellular marine algae (di- 

vision Haptophyta; Coccolithus lineage), covering 57 million years, 
QQ , are studied by a system of linear stochastic differential equations of 

hierarchical structure. The data consists of size measurements of fos- 
silized calcite platelets (coccoliths) that cover the living cell, found 
in deep-sea sediment cores from six sites in the world oceans and 
' dated to irregular points in time. To accommodate biological theory 

of populations tracking their fitness optima, and to allow potentially 

■ interpretable correlations in time and space, the model framework al- 
C/3 ' lows for an upper layer of partially observed site-specific population 

means, a layer of site-specific theoretical fitness optima and a bottom 
layer representing environmental and ecological processes. While the 
| modeled process has many components, it is Gaussian and analyti- 

■ cally tractable. A total of 710 model specifications within this frame- 
' work are compared and inference is drawn with respect to model 

structure, evolutionary speed and the effect of global temperature. 

1. Introduction. Biological populations evolve with respect to the dis- 
. tribution of organism size and other phenotypic traits by differential fitness. 

How a phenotypic character like body size evolves over time and what en- 
vironmental factors influence phenotypic change are fundamental questions 
of biology and paleontology. Some key issues include the following: Is the 
average size fluctuating around a fixed or a changing fitness optimum [Estes 
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and Arnold (2007)]? How fast does the population mean track the optimum? 
What is the inertia of the fitness optimum, and is it tracking a deeper pro- 
cess representing environmental conditions? If there is a deeper process, what 
qualities does it have? Are population-specific optima related, for example, 
by influence from an unobservable underlying global climatic process? Al- 
ternatively, are the phenotypic means at different sites correlated directly? 
Is the optimum responding to global temperature? 

Several recent studies have highlighted a covariance between the body 
size evolution of marine organisms and global temperature. The long-term 
cooling trend over the past 65 million years appears to be matched by 
a macroevolutionary size decrease in marine phytoplankton [Finkel et al. 
(2007)] while marine zooplankton and benthos show size increases [see Schmidt 
et al. (2004), Hunt et al. (2010)]. With the exception of Hunt et al. (2010), 
which only used random walks as a time series model, these studies did, how- 
ever, not take time dependency into account when testing this hypothesis. 
Without consideration of the internal dynamics of two trended time series, 
the null hypothesis of no interaction between the series may be wrongfully 
rejected. Here, we study the evolution of body size in marine algae by means 
of layered stochastic differential equation (SDE) models. 

In unicellular algae, cell size and its geometry largely determine the trans- 
port rates of dissolved components (e.g., CO2/O2, nutrients) into and out of 
the cell, which is fundamental to photosynthesis and growth rates. We study 
the evolution in average log-phenotype of calcifying microalgae belonging 
to the Coccolithus Schwartz 1894 lineage. The Coccolithus lineage (division 
Haptophyta) has extant species in today's oceans and a well-documented fos- 
sil record dating back to the early Paleocene [see Haq and Lohmann (1976)]. 
We study coccolith size, which is measured by the largest diameter in the 
elliptically shaped calcite platelets (coccoliths) that cover the unicellular or- 
ganism. Coccolith size is used as a proxy for cell size [see Henderiks (2008)] . 
A total of 205 deep-sea sediment samples, taken from 6 different sites in the 
Atlantic, Indian and Pacific oceans, offer a final data set of 19,899 individual 
size measurements distributed over the last 57 million years (My) (Figure 1). 
To current standards of palaeontology and evolutionary biology, these data 
are unusually extensive. 

The population of algae behind sampled coccoliths from one site is as- 
sumed to track its fitness optimum by natural selection. Fitness (expected 
number of reproducing offspring) is distributed over the individuals of the 
population, with the optimum varying over time according to physical and 
ecological conditions. The fitness optima of different populations may be 
spatially correlated, due to common environmental conditions, in addition 
to being temporally correlated within a population. What affects fitness with 
respect to coccolith size is unknown. Global mean ocean surface tempera- 
ture might be a contributor, and a measured temperature indicator series 



LAYERED SDES 



3 



2.5 




i 1 — 1 — 1 — |— r - i — i — 1 i — 1 — 1—1 — i — 1 — 1—1 — 1 — i — 1—1 — 1 — 1 i 
-60 -50 -40 -30 -20 -10 

time (million years) 



Fig. 1. DSDP and ODP site locations (inset map) and site-specific mean logarithmic 
size values spanning the last 57 My. One symbol is shown for each of the 205 observations 
of mean log coccolith size in the time series plot. 

[Zachos et al. (2001)] is tested in the model framework as potentially driving 
the fitness optimum processes along with other underlying but unmeasured 
processes. 

In this study, we will work on the logarithmic scale, as untransformed coc- 
colith sizes cannot be normally distributed. Mean logarithmic coccolith size 
for each population is seen as a stochastic process pulled toward an optimal 
state, which itself is seen as a stochastic process subject to pull from addi- 
tional underlying processes. These underlying processes might be population 
specific and possibly correlated across populations. The model framework 
thus allows for three hierarchical layers of processes, each layer having one 
process for each geographical location. In the building of our framework, the 
three layers are the population means, population fitness optima and under- 
lying environmental processes affecting fitness such as an observed global 
temperature indicator series and also unmeasured environmental variables. 

As the data are irregularly distributed in time, a stochastic time series 
framework that can handle continuous time is called for. Linear SDE models 
constitute a parsimonious framework for such modeling. Vector processes 
governed by linear SDEs, as defined in Section 3, are suitable for hierarchical 
models with variables ordered by the flow of causality as determined by the 
coefficients in the equations; see Schweder (2012). Linear vectorial SDEs are 
analytically tractable, having an explicit likelihood. 

A collection of 710 variants of models within this framework were fitted to 
the sample means at the six locations. These model variants are compared 
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and properties of the evolutionary process are discussed by way of common- 
alities in models which are singled out by various statistical criteria. We use 
both Bayesian and classic methods. 

Lande (1976) was the first to suggest that the population mean of a phe- 
notypic trait, for example, logarithmic coccolith size, might evolve like an 
Ornstein-Uhlenbeck (OU) process. This was taken further by Estes and 
Arnold (2007). The Lande model [Lande (1976)] for the evolution of the 
mean phenotype, such as logarithmic size, is governed by the one-dimensional 
linear SDE, 

(1) dX(t) = -a(X(t) - fi )dt + adW{t), 

where dW(t) is white noise and t is time. This is an OU process when 
a > 0. The mean phenotype X is pulled toward the level //o with a force 
proportional to its displacement X(t) — fio both when having been pushed 
by randomness above or below its long-term level hq. This level is the fitness 
optimum in Lande (1976). We will use the term pull for the parameter a 
and diffusion for a. Phenotypic evolution is also modeled by OU processes in 
Hansen (1997), though in a phylogenetic rather than time series perspective, 
and so do Hansen, Pienaar and Orzack (2008), where the level process /x(t) 
is defined to be an underlying random walk. A random walk model, which 
can be seen as a limiting case of equation (1) with a = 0, has been studied 
by Hunt (2006) and Hunt, Bell and Travis (2008). Random walks have been 
in use for a long time, as a proposed null hypothesis in evolutionary models; 
see, for instance, Raup (1977). SDE models have been around for years, and 
have been used in biology [see Allen (2003), Chapter 8] and physics [see 
Schuss (1980), Chapter 2]. 

Our aim is to develop a class of models that is capable of describing 
the data reasonably well, that allows biological interpretations and that is 
statistically feasible. These models might also be useful elsewhere, especially 
for time series with irregular temporal resolution. 

In Section 3 we will first introduce the linear SDE framework tailored for 
our application. We will then briefly discuss issues of causality and hierarchy. 
Model selection and inference is described in Section 4. These methods are 
then applied first to artificial data in Section 5 and then to the coccolith 
data in Section 6. Some concluding remarks are offered in Section 7. 

2. The coccolith data. Microfossils (coccoliths) were measured in a total 
of 205 sediment samples obtained from Deep Sea Drilling Project (DSDP) 
and Ocean Drilling Program (ODP) deep-sea sediment cores, taken at six 
sites in the Atlantic, Indian and Pacific oceans, altogether spanning about 
57 My; see Figure 1. 

In each sample, 1 to 400 individual coccoliths were measured on slides us- 
ing polarized light microscopy [see Henderiks and Tdrner (2006), Henderiks 
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(2008)], resulting in a mean sample size of 19,899/205 = 97. The data was 
transformed from original size in fj,m to the logarithm of that, before means 
and variances were calculated. The age of each sample is estimated from 
biostratigraphic data calibrated to the geological time scale of Cande and 
Kent (1995). 

While the coccoliths within a single sample may have been formed as much 
as one thousand years apart, we consider them simultaneously formed on 
our geological time scale. The sampling process is assumed random, from the 
historical population through deposition fossilization, drilling and extraction 
from the drill cores. 

The stationary distribution of mean log size is normal in our model. Simi- 
larly, any lack of normality in the samples should give little cause for concern 
when modeling phenotypic means by sample means, due to the central limit 
theorem. In the tradition following Lande (1976), we study the evolution of 
the population mean of the phenotype in question, which in our case is the 
logarithmic body size. Population variance and other population processes 
and parameters are of potential interest, but for our study these are nuisance 
parameters. To keep the complexity of our model within bounds dictated 
by the data, the population variances are estimated outside the SDE frame- 
work and smoothed by a simple GAM analysis so as to reduce the variance 
uncertainty for samples with a low number of measurements. 

We assume the sample means to be normally distributed with known 
variances; see equation (2) in the next section. The 19,899 separate mea- 
surements are thus summarized by the 205 sample means. The assumption 
of normality and known variances is not strictly true, but not far off, and it 
allows the likelihood to be computed. 

To diagnostically check the assumption that the sample means of log 
coccolith size are normally distributed, skewness and kurtosis in the distri- 
butions of the sample means were estimated by bootstrapping for the 192 
samples with more than 4 observations, and tested. The p-value plots [see 
Schweder and Spj0tvoll (1982)] show only small departures from normality, 
with some sample means having a bit of positive skewness, but with no trace 
of kurtosis. See the supplementary information for more on the normality of 
the data, the treatment of the measurement variances and other data issues. 

3. Layered linear SDE models. 

3.1. The modeling framework. If we concentrate on a single population, 
the idea is to connect measurements from different times together using a 
continuous time series model. Thus, the irregular time intervals between 
observations will not pose a problem. 

As described in the Introduction, we aim at describing the phenotypic 
mean as a process responding to another continuous time process which 



G 



T. REITAN, T. SCHWEDER AND J. HENDERIKS 



is hidden, namely, the size at optimal fitness, which may again respond to 
environmental changes, partially or entirely unmeasured. Thus, the equation 
describing the dynamics in phenotypic mean (layer one) will not only contain 
the phenotypic mean, but also a hidden underlying process representing 
optimal fitness (layer two). Layers are in general processes defined so that 
one layer can respond to the current state in another layer, which is then 
said to be below it. The layer may then in its turn affect the layer above it, 
unless it is the topmost layer. 

In our framework, layer two can again be affected by a third layer, under- 
stood as the unmeasured environmental conditions. The second layer might 
also be forced by an external global temperature indicator time series on 
this layer, as measured by Zachos et al. (2001). 

In an SDE, the change in continuous time series from one time point to 
another time point infinitesimally further along is modeled by a transforma- 
tion of the previous state plus some normally distributed noise. When this 
transformation is linear and the observational noise is assumed normal and 
not conditioned on the state, we call this a (vectorial) linear SDE. When 
one has a linear SDE system and the state at the starting point of the pro- 
cess is normally distributed, then both the marginal distribution and the 
conditional distribution of the state of a later time point is also normally 
distributed. When also the sampling errors are normal, a likelihood can be 
calculated. These conditions provide the modeling framework for this study. 

A sample mean Yj for time t is thus considered a noisy representative 
of the state of the topmost layer X\(t), namely, the phenotypic population 
mean at the same time point. So 

(2) Yt-NiX^^l/nt), 

where si is the sample variance and is the sample size. The sampling er- 
rors Yt — X±(t) are assumed independent. In the language of hidden Markov 
models, equation (2) will be the observation equation while the system equa- 
tion will be described in the next subsections. Since the state process is a 
normally distributed linear hidden Markov chain and the observations are 
normal and independent given the state process, the Kalman filter can be 
applied for calculating the likelihood; see the supplementary material [Rei- 
tan, Schweder and Henderiks (2012)]. 

As there are six geographic sites, instead of a single process per layer, 
we have six components in each layer. To allow for possible instantaneous 
commonalities in a layer, there might be instantaneous correlations across 
sites in the layer. We allow only correlations between processes belonging to 
the same layer. 

Collected over the 6 sites, the state of the system is situated in an 18- 
dimensional vector space and evolves according to a linear hierarchical sys- 
tem of SDEs. 
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3.2. Linear SDE basics and examples. The OU process, described in 
equation (1), is a simple linear SDE process. It is stationary when a > 0. 
The stationary distribution is normal with expectation E(X(t)) = /j,q and 
standard deviation n = sd(X(t)) = a/y2a. This can be verified by equa- 
tion (5) later in the text. The stationary correlation is corr(X(ii), Xfo)) = 
e — a(*2— *i) _ e — (*2— ti)/At^ ^ < t 2 , where At = 1/a gives a characteristic cor- 
relation time for the process, the time for the correlation to drop to 1/e. 
Such alternate parametrizations, using, for instance, At and r/ instead of 
the pull a and the diffusion a, can make the interpretation of results easier, 
as well as help in the elicitation of Bayesian priors. Thus, with three param- 
eters, one can parsimoniously describe a continuous function of time that is 
a stationary process. 

When the level term fj,Q in equation (1) is replaced by a process X2(t) of 
the OU type, one gets a coupled linear SDE: 



where X\ is the topmost, partially measured process layer. It is driven by X2, 
which is a hidden process. Since X2 is unaffected by X\, it is an OU process 
on its own. X\ is, however, dynamically affected by X2 and has a different 
time correlation function from that of an OU process, as we will show later. 
This causal structure is described by X2 — > X\ . When this causality takes 
the form found in equation (3), X\ is said to be tracking X2. 

The joint process of X\ and X2 stacked in a vector is a special case of the 
following vectorial linear SDE, 



where the state vector X(t) and m(t) are p-dimensional, A is called the pull 
matrix and is of dimension p x p, m{t) is regarded a deterministic process, 
£ is a p x q dimensional diffusion matrix and W is a g-dimensional Wiener 
process. The structure of direct causal relations between components of X 
is determined by the nonzero elements in the pull matrix A. If Aij 7^ 0, then 
component Xj directly affects X;, and there is an arrow from Xj to Xi in 
the causality graph. The notion of causality used here is equivalent to that 
of Granger (1969) in Markov process models; see Schweder (2012). Since the 
contributions in equation (4) are normal and linear, the distribution will be 
normal and thus characterized by the expectation and covariance: 



dXi(t) 



ai(Xi(i) - X 2 (t)) dt + a x dWx{t) 



(3) 



dX 2 (t) 



a 2 (X 2 {t) - fi ) dt + a 2 dW 2 (t) 



(4) 



dX{t) = {m{t) + AX(t)) dt + Z dW{t) 



(5) 



EX(t) 



U- 1 e A( '-' o) UX(t ) + U 



-1 




e A{t - u) Vm{u)du 




V 



-1 
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for v < t, where V is the left eigenvector matrix and A is the diagonal matrix 
of eigenvalues of A, so that A = V~ 1 AV conditioned on the existence of such 
a representation. A layered causal structure is achieved by specifying which 
of the elements in A are nonzero and by restricting the diffusion matrix to 
a blocked structure. See the supplementary material [Reitan, Schweder and 
Henderiks (2012)] and Schweder (2012) for more on this. 

Since the covariance structure of X(t) governed by equation (4) is avail- 
able in a closed form in equation (5), the covariance of the topmost compo- 
nent of the process of equation (3) can be found to be 

(6) cov(X l( 0), Xl (t)) = ^e--< + (^~ a2t ~ ^~ ait ) - 

provided a\ ^ 012 are both positive. The covariance structure here is different 
from the one-layered OU process described in equation (1). This indicates 
that data generated from the top layer of a two-layered process will be 
detectably different from that of the one-layered case. 

The model we initially made for the coccolith size data had three layers, 
phenotypic mean, optimum and climate. We assumed the stochastic contri- 
butions in the second layer to be identical over the sites, while the stochastic 
contributions in the two other layers were uncorrelated over the sites. We 
have k = 6 processes indexed by j representing each of the sites in each of 
the 3 layers, denoted by X\ d in the top layer, X 2d in the middle layer and 
X 3d at the bottom. The flow of causality is 

(7) X 3J ^ X 2:j ^ X ltj . 

So, for a given site j G {1, . . . , 6}, we have the following SDE system: 
dX ld {t) = -an(X ltj (t) - X 2d {t)) dt + <j\ dWi d (t), 

(8) dX 2>j (t) = -a 2 (X 2>j (t) - X 3;j (t)-pT(t))dt + a 2 dW 2 (t) , 

dX 3d (t) = -a 3 (X 3d (t) - no) dt + c7 3 dW 3d (t), 

where the W processes are independent Wiener processes. The regressor 
term f3T(t) defines how the second layer responds to an exogenous process 
T(t), which in our application will be a global temperature indicator series, 
as measured by Zachos et al. (2001). Thus, m(t) in equation (4) will have a 
contribution a 3 fio on the third layer and a contribution a 2 (3T(t) on the sec- 
ond. The second layer thus tracks a linear combination of the third layer and 
the external time series T(t). While there are 18 processes in this particu- 
lar model, it has only 8 parameters, namely, 9 = (cti, ol 2 i ol 3 , <ti, 02, 03, f3, fJ>o)- 
The processes are regional (though inter-regional instantaneous correlation 
is introduced in layer two) but the parameters are global and thus so is 
the nature of the dynamics. The covariance function on the top (pheno- 
typic) layer for this model is shown in the supplementary material [Reitan, 
Schweder and Henderiks (2012)]. 
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3.3. Pull identifiability. It is shown in the supplementary material [Rei- 
tan, Schweder and Henderiks (2012)] that in any multi-layered process hav- 
ing only one pull parameter per layer, the pulls can be reshuffled so that if 
there are I layers, then a/ < cty_i < ■ • ■ < a2 < a±. It thus initially seemed 
natural to restrict the inference outcomes so that this is the case, which will 
be called the pull identification restriction. However, in the case of multiple 
sites, requiring, for example, that 03 < «2, s < a\ for all sites s will exclude 
some covariance structures. When one analyzes data using the model frame- 
work without imposing the pull identification restriction, one does, however, 
need to keep in mind that a reshuffling of the pull parameters is possible 
and may in some cases even be necessary. The alternative is either to impose 
the restriction only on one specific site or to impose them on all and so risk 
overlooking valid solutions. As we wished to avoid singling out a specific site 
and did not want to risk being overtly restrictive, we did not enforce the 
pull identification restriction. The identification issue should, however, be 
kept in mind when interpreting the results. In the supplementary material 
[Reitan, Schweder and Henderiks (2012)] we present results also under the 
pull the identification restrictions. 

When initially testing the simulation study, we used the pull identifica- 
tion restriction. The simulation analysis suggested an error in our treatment 
of the combination of prior distribution and the identification restriction, 
which resulted in Bayesian model selection support for over-complex mod- 
els. For one-layered models the prior distribution for the pull parameter 
remains undisturbed. However, when imposing the pull identification restric- 
tion on our prior distribution for multi-layered models, the marginal distri- 
bution of each pull parameter became narrower. Thus, if the data suggests a 
pull parameter within this narrower interval, this is better predicted by the 
multi-layered model which is then unfairly supported in a Bayesian model 
likelihood comparison. We thus corrected our prior distribution so that the 
marginal distribution for at least one common pull parameter remains the 
same for models having different numbers of layers. See the supplementary 
material [Reitan, Schweder and Henderiks (2012)] for more on this issue and 
for the simulation results after correcting the prior distribution. 

4. Likelihood-based inference methodologies. 

4.1. Single model inference. A process governed by equation (4) is Gaus- 
sian with known mean and covariance structure. The measurement errors 
then add extra variance (equal to the sample variance divided by the sample 
size), which is also regarded as known. The likelihood of observations of the 
process affected by independent normal measurement errors is thus avail- 
able in closed form. The likelihood function is, however, often multimodal 
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since it usually is a nonlinear restriction of the covariance matrix of the ob- 
servations; see, for example, Sundberg (2010). Our experience is also that 
the coccolith data lead to multimodal likelihood functions for the models we 
consider. For maximum likelihood (ML) estimation a shotgun approach with 
a hill-climbing algorithm from some 50 starting points widely distributed in 
parameter space seemed to work reasonably well, as long as the number of 
parameters were kept low. Stable and efficient ML results were, however, 
only gotten when the hill-climbing algorithm was initialized using posterior 
Bayesian Markov chain Monte Carlo (MCMC) samples. 

In a Bayesian setting using MCMC sampling, the multimodality issue 
seems to have been dealt with and stable, reproducible results were pro- 
duced. 

Due to the nonlinear parameters and the relatively low number of mea- 
surements, the Hessian at the likelihood maximum cannot be expected to 
provide reliable measures of estimation uncertainty. Bootstrapping might 
work, but is computationally expensive. Bayesian MCMC samples seem, 
however, to work and do yield approximate credibility intervals and scat- 
ter plots showing the dependency in the inference on different parameter 
combinations. 

For the application to the fossil coccolith data, to be discussed next, 
we thus used Bayesian methods both for model choice and for parameter 
estimation, though we also used Akaike's Information Criterion (AIC) in 
the model selection and included ML estimates in the final presentation of 
models. 

Further description of the Kalman filter, numerical ML optimization, 
Bayesian MCMC techniques for single model inference, numerical problems 
and computational efficiency are provided in the supplementary material 
[Reitan, Schweder and Henderiks (2012)]. 

4.2. Model variants. Our model frame for the coccolith data is an 18- 
dimensional version of equation (4) with 3 hierarchical layers with 6 com- 
ponents each. The diffusion matrix £ is block diagonal accordingly, with Sj 
as the block for layer i. We also consider similar models having one or two 
hierarchical layers. For each layer i, we consider the following variants: 

(1) Regionality: It could be that one or several parameters (expectation, 
layer-specific pull or diffusion) are regional, that is, different for different 
sites. Regionality is, however, only allowed in one of the three layers in a 
model variant. For more on this, see the supplementary material [Reitan, 
Schweder and Henderiks (2012)]. 

(2) Determinism: A layer will be deterministic if it receives no stochastic 
contributions, that is, if S, = 0. Note that the pull and characteristic time 
may still be finite, so that the layer performs a deterministic filtering of the 
underlying stochastic layer. 
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(3) Random walk: This is only possible for the lowest layer (interpreted 
as the climatic layer in this application). With a random walk layer, the 
process is nonstationary. Approximate random walk is achieved by setting 
the pull very low. Ideally, the pull should be set to zero, but for practical 
reasons, we chose to implement a, = 0.001, which means Aij = 1 Gy. 

(4) Regional correlation: The components in a layer might be instan- 
taneously correlated. We allow only one correlation coefficient, Sj Jlj2 = 
Pi cr i,ji cr i > j2 f° r an y two sites ji j2- With p% = l, the instantaneous stochas- 
ticity [associated with dW(t) in equation (4)] in the layer collapses to one 
dimension, as for the second layer in equation (8). 

By crossing model variants and by varying restrictions on the parameters, 
we end up with 710 different models within our model framework. Without 
restrictions to only one layer with respect to regionality, the number of 
model variants to consider would much exceed 710, which already is a big 
number. Methods for exploring properties of the evolution of Coccolithus by 
confronting these models with the data in a Bayesian way is discussed next. 

4.3. Model and property comparison. In order to calculate Bayesian model 
probabilities, one needs the marginal data likelihood, 



where D is the data, M is the model, 9 is the vector of parameters and ir is 
the Bayesian prior distribution of the parameters. Since this is conditioned 
on the model, we will call it the Bayesian model likelihood (BML). From 
this and the prior model probability Pr(M) = 1/710, the posterior model 
probability Pr(M\D) is calculated by Bayes' theorem. Since the prior model 
probabilities are assumed equal, Pr(M\D) is proportional to the marginal 
likelihood, f(D\M). Comparing models by their posterior probabilities, one 
should note that the posterior probabilities cannot be interpreted in absolute 
terms. The problem here is that there might be many models that are very 
similar to each other, and the posterior probability of this type of model will 
be diluted by the model multiplicity. 

The Bayes factor of two competing models is defined by how much the 
relative probability for the two models change from the prior to the pos- 



terior case, £i,2 = (Pr(Mi|D)/Pr(M 2 /D))/(Pr(Mi)/Pr(M 2 )) = f{D\M{)/ 



f(D\M2). Thus, it is equal to the marginal likelihood ratio. The Bayes factor 
describes the amount of evidence in the data for model 1 versus model 2. 

In equation (9) the parameter 9 is the same for all model specifications. 
These models differ, however, with respect to restrictions imposed on the 
parameter, and the prior distribution is modified accordingly. We use a very 
wide prior distribution; see below. Due to the complexity of the prior and 
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nonlinearities in the model, the marginal data likelihood is not analytically 
available despite the likelihood function being Gaussian. The numerical 
method we use is an importance sampler described in the supplementary 
material [Reitan, Schweder and Henderiks (2012)]. 

Since we look at many models with different properties, some such prop- 
erties might be evaluated by expressing the different categories of a given 
property and identifying the models within each category. Looking at such a 
multitude of models means that many combinations of properties are tested 
at once. A model may then sometimes get support simply by statistical fluc- 
tuations. It will therefore be beneficial to look for a few general model prop- 
erties, one at a time, and combine those, when looking for the best model. In 
order to assess what the data indicate concerning different properties, each 
category can be given the same property-specific prior probability, which 
we will call a weight for the application of studying model properties. Each 
model within that property can be given the same prior weight, so as to bet- 
ter illustrate in which direction the evidence pulls and how much evidence 
there is. 

If each property was equally probable and each model within each cate- 
gory also was equally probable, the weights would be Bayesian probabilities. 
We are, however, not assuming that this is reasonable. Instead, we simply 
use the formalism of Bayesian probability theory, with probabilities relabeled 
as weights, in order to explore how reasonable a given property category is. 
Note that the ratio of one weight to another will form the traditional Bayes 
factor for a property under the assumption of equal model probabilities 
within each property. If different categorizations are made, this will result 
in different prior weights for each model, so it seems more appropriate to 
call both the input and the results for weights rather than probabilities. 

A categorization allows for an evaluation of the evidence for a given cat- 
egory of a property by weighting it with the number of models it contains. 
This is done by evaluating the posterior weights 



based on the prior weights — , where C is a category, iiq is the number of 
models within the category, and C runs over the categories of the property. 
Thus, this method of comparison compensates for the difference in number 
of models with different properties. 

5. Simulations. In order to test the model comparison, we made artifi- 
cial data sets based on models with one, two and three layers. The models 
used for generating the data sets were the best one-, two- and three-layered 
model according to BML applied on the original data set while using the ML 
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Table 1 

Number of correct identifications of a model ( as given by the 
number of layers) according to different selection schemes, 
when a total of 20 data sets are simulated; see text 



Actual model 


AIC 


BIC 


BML 


One layer 


16 


20 


15 


Two layers 


20 


18 


12 


Three layers 


19 


15 


19 



estimates for the choice of the parameter set. See the supplementary mate- 
rial [Reitan, Schweder and Henderiks (2012)] for a description of these three 
models. Twenty data sets were then simulated for each of these three cases, 
with the same amount of data situated at the same measurement times and 
having the same measurement uncertainty as the real data, but where the 
state processes themselves were simulated using these models. 

The same three models were then used for analyzing each of these three 
types of data sets and to see how often different statistical model selection 
methods, AIC, BIC and BML, resulted in the right number of inferred layers. 
The results are shown in Table 1. BIC seems to perform best when a one- 
layered model has produced the data, but underperforms for data produced 
by a three-layered model. In total, it performs slightly worse than AIC. 
Fisher's exact test for contingency tables suggest that there is no significant 
difference in performance between AIC and BML for data produced by a 
one- and three-layered model, but that BML underperforms in relation to 
AIC for data produced by a two-layered model. 

It should, however, be noted that these observations only hold for our 
particular choice of one-, two- and three-layered models and is conditioned 
on the parameter values behind the simulated data. We used the maximum 
likelihood estimates. Since Bayesian methods allow for property inference, 
inclusion of prior knowledge and measures of parameter uncertainty, we 
will use both AIC and BML in the analysis. It should also be noted that 
according to the results, it is more likely than not that any of these three 
methods will identify the correct number of layers, no matter which of the 
three models produced the data. 

6. Results and discussion for the Coccolithus data. We ran the model 
selection exercise without imposing the pull identification restrictions (Sec- 
tion 3.3). The model variants selected by BML and AIC as the best violated 
these restrictions, and were found substantially better than models satisfy- 
ing the restrictions. As a consequence, the analysis shown here is without 
pull identification restrictions. Results under these restrictions are shown in 
the supplementary material [Reitan, Schweder and Henderiks (2012)]. 
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Table 2 

Maximum likelihood (ML) estimates, Bayesian posterior median (B. median), 
95% prior and posterior credibility intervals for the model considered 
best according to BML. The units of the characteristic times Ati aye r,Bite are 
specified in the table, where the following units are used: y = year, ky = 10 3 years, 
My = 10 6 years, Gy = 10 9 years. The fixed median of the state , specified in the 
original measurement scale, is given in units of fim = micrometers . The parameter ft is 
the effect of temperature indicator on the second layer process in units of log(size)/C° . 
The diffusion parameters ai ayor are in units of log(size)/My 1//2 





Parameters 


ML 


B. median 


Prior 95% 


Posterior 95% 






7.42 


7.43 


0.001-1000 


7.15-7.69 




P 


0.0006 


0.0005 


-1-1 


-0.002-0.012 


Upper 


Aii 


11 ky 


16 ky 


1 ky-1 Gy 


0.3 ky-80 ky 


layer 


(Ji 


0.48 


0.43 


0.02-3.5 


0.22-2.8 


Middle 


At2,525 


130 ky 


120 ky 


1 ky-1 Gy 


1.0 ky-0.52 My 


layer 


At2,612 


215 My 


40 My 


1 ky-1 Gy 


0.58 My-3.2 Gy 




At2,516 


0.4 ky 


11 ky 


1 ky-1 Gy 


90 y-150 ky 




Ai2,752 


1.0 My 


1.0 My 


1 ky-1 Gy 


250 ky-2.9 My 




At2,806 


3.7 Gy 


88 My 


1 ky-1 Gy 


2.0 My-10 Gy 




At2,982 


0.4 ky 


12 ky 


1 ky-1 Gy 


100 y-140 ky 


Lower 


Ai 3 


1.2 My 


1.4 My 


1 ky-1 Gy 


0.64 My-3.7 My 


layer 


(T3 


0.17 


0.16 


0.02-3.5 


0.11-0.24 




P3 


0.66 


0.64 


-0.18-0.98 


0.29-0.85 



6.1. Multi-layered model selection. Going through all 710 model vari- 
ants, we opted to do the model selection both by looking at BML and AIC 
and by doing Bayesian property analysis. 

To our knowledge, little information is available on which to base the prior 
distribution, except broad ideas about the time scales involved in biological 
processes and general knowledge about the size and variation of size in single- 
celled organisms. We therefore construct our prior by applying independent 
normal distributions to each type of parameter (or rather a reparametrized 
version so that the new parameters can be allowed to take values over the 
entire real line). See the supplementary information [Reitan, Schweder and 
Henderiks (2012)] for more concerning the prior distribution, which was wide 
but informative. 

The best model according to BML (and second best according to AIC), a 
three-layered model, is shown in Table 2. It had a Bayes factor B = 215 com- 
pared to the best one-layered model, while the AIC difference between the 
best multi-layered model according to AIC and the best one-layered model 
according to AIC was A AIC = —11.2. The best model according to AIC 
was a variant of the best model according to BML only with an extra inter- 
regional correlation found in the first layer. Since the top layer turned out to 
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have very fast-moving dynamics, any contributions due to inter-regional cor- 
relations would quickly vanish. Therefore, the structural difference between 
the best AIC and best BML model made very little practical difference. Two 
fairly different statistical model selection methods thus gave approximately 
the same model, having overall the same structure. There was regional pull 
without diffusion in the second layer (deterministic filtering of the lowest 
layer, with different filtering for different sites). Inter-regional correlations 
were inferred to be on the third layer. 

According to Table 2, the second layer contains both sites with greater 
and smaller pulls than that of the third layer, which is a result not possible 
if the pull identification restriction is imposed. Note that while layers can be 
switched in a multiple site setting as well as for single set, there is no way 
to switch them so as to make pulls progressively lower for all sites. 

Compared to our initial model, there are a couple of changes. The inter- 
regional correlation is moved to the lowest layer and is no longer perfect. 
The pull parameters on the second layer are regional rather than global and 
there is no diffusion on that layer. Thus, this model has five more parameters 
than the initial model. 

6.2. Influence of external data series (global climate). After finding the 
best model according to BML, we looked at expansions that allowed for 
an external data series to influence the second layer though a regression 
term, (3. The 95% credibility interval for /3 encompassed zero, indicating cell 
size to be rather unaffected by the global temperature indicator. The Bayes 
factor between the model described and the same model without the tem- 
perature indicator series regression, B = f(D\M no temp) / f (D\My,ith temp), is 
about 662. The Bayes factor for the regression is sensitive to the choice of 
prior for (3, so that a wider 95% credibility interval (—10, 10), results in the 
Bayes factor larger than 6000. Thus, the strength of the evidence against 
temperature dependence is sensitive to the prior, though for both the ini- 
tial and the alternative choice of prior distribution for /3, the Bayes factor 
indicates no evidence for temperature dependence. With a 95% credibility 
interval encompassing zero and having an upper limit of about 0.012 and 
with a variation in the temperature indicator series of about 4.7, the impact, 
if any, of the temperature indicator variations on the second layer would with 
95% probability be less that exp(4.7 x 0.012) « 6%. Thus, the global tem- 
perature indicator influence on the coccolith size will be disregarded for the 
rest of the analysis. 

6.3. Inference of model properties. Inference on the process states they 
themselves can also be performed in the same framework, using the Kalman 
smoother method. An example of this is shown in Figure 2, where the three 
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(a) 

Fig. 2. Process inference and measurements for Site 752 running in the time interval 
— 26 My to —8 My, using the model in Table 2, which is considered best according to 
BML. The inference takes Bayesian parameter uncertainty into account as well as process 
uncertainty given parameter values, (a) Three-layered posterior means. Solid line: upper 
layer; short dashed line: middle layer; long dashed line: lower layer; circles: measurements. 
The upper and middle layer may be hard to separate, except for small time intervals close to 
measurements. Note that there are patterns in the lowest layer before the first measurement 
belonging to this site. This is because the lowest layer has correlations to other sites, for 
which there are measurements before this time, (b) Upper layer posterior mean, uncertainty 
and one realization. Solid line: mean; dashed lines: limits of a 95% credibility interval; 
grey wiggly line: realization; circles: measurements. Except for the right side of the graph, 
where there are many high quality measurements, the credibility interval is wide. Note 
that while the upper layer realization may seem noisy due to the short-memory stochastic 
contributions to that layer, it is also influenced by lower layers with longer memory. This 
creates correlations over large time spans, so that the realization can be over or under the 
top layer mean for large time spans. 

layers belonging to a single site in our data set have been studied using sam- 
pled parameter sets (taken from the posterior MCMC samples) on the model 
considered best. Figure 2(a) indicates that the top layer is well adapted 
to the observations, while the processes of the layers below have expecta- 
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Fig. 2. (Continued). 

tion values that are progressively smoother versions of the layer above. Fig- 
ure 2(b) shows how the uncertainty of the topmost process can be associated 
with a fixed maximum span far outside the observations but becomes pro- 
gressively smaller the closer in time one is to an observation. However, the 
credibility interval does not shrink to zero, because of observational noise. 

Looking at property inference, as described in Section 4.3, we get results 
which are shown in Table 3. The data give little posterior weight to a pro- 
cess having only one layer, while higher weights are given to two or three 
layers. Also, there is an OU process rather than a random walk in the lowest 
layer (thus stationarity) and there is evidence for inter-regional correlations. 
Three layers seem to be slightly preferred over two layers and regional pull 
parameters are found in the middle layer. There seems to be support for the 
existence of at least one deterministic layer. These results support models 
having the same overall structure as the top model described in Table 2. All 
of the top five models under both AIC and BML selection, described in the 
supplementary material [Reitan, Schweder and Henderiks (2012)], also seem 
to have the same properties. 
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Table 3 

Posterior weights for different properties, with the number of models in parenthesis. Note 
that while regional parameters can be found either in the first, second or third layer in 
a three-layered model, such a feature is only possible in the first and second layer in 
a two-layered model and only in the first layer in a one-layered model. Similarly, 
a three-layered model may be allowed to have no diffusion either in the first or second 
layer or both, but a two-layered model may only be without diffusion in the first layer and 
a layer without diffusion is not an option for a one-layered model 



Property Options 



Number of layers: 


1 


2 


3 






7.5% (18) 


32.4% (114) 


60.1% (578) 




Regionality in: 


none or jj,o 


pull 


diffusion 






0.2% (177) 


88.3% (259) 


11.5% (274) 




Regional parameters in: 


no layer 


layer 1 


layer 2 


layer 3 




0.2% (177) 


13.1% (205) 


69.1% (196) 


17.5% (132) 


Inter-regional 


none 


intermediate (6D)perfect (ID) 




instantaneous correlations: 


4.9% (50) 


85.2% (212) 


9.9% (448) 




No diffusion in 


no layer 


layer 1 


layer 2 


both layer 1 and 




7.7% (486) 


1.2% (132) 


91.1% (72) 


0.007% (20) 


Random walk in 


no 


yes 






the lowest layer: 


99.1% (414) 


0.9% (296) 







Bayesian model comparison can be hampered by sensitivity to the prior 
probabilities. We therefore did a new analysis with another prior distribution 
deemed reasonable, namely, one where the characteristic times were given 
95% credibility bands spanning from 1 y to 300 My rather than 1 ky to 
1 Gy and where the distribution of the diffusion parameters were widened. 
The top two models according to BML remained the same and the parame- 
ter estimates remained essentially the same also. The property weights were 
subtly shifted, but not enough to change any previous conclusions. These 
results suggest to us that the Bayesian inference was robust. See the sup- 
plementary material [Reitan, Schweder and Henderiks (2012)] for details 
concerning various sensitivity tests. 

The structure of the inter-regional correlations can be investigated by 
categorizing according to in which layer the inter-regional correlations exist. 
The posterior weights (Table 4) indicate support for inter-regional correla- 
tions in the lowest layer and possibly also on the top layer. According to the 
best model in Table 2, the top layer is found to be very fast moving (having 
a characteristic time less than 10 ky, which is less than the smallest time in- 
tervals in the data). With the rather coarse time resolution in our data, the 
issue of inter-regional correlations in the upper layer must therefore remain 
unresolved. 
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Table 4 

Posterior weights of correlation structure properties. Only three-layered models with 
nondegenerate pull at the bottom layer (333 in total) were examined 



Inter-regional correlation in 



Top layer 


N 


Y 


N 


Y 


N 


Y 


N 


Y 


Intermediate layer 


N 


N 


Y 


Y 


N 


N 


Y 


Y 


Bottom layer 


N 


N 


N 


N 


Y 


Y 


Y 


Y 


Number of models 


15 


16 


30 


32 


56 


60 


60 


64 


Posterior probability (%) 


1.6 


1.1 


4.8 


2.5 


49.9 


25.2 


10.1 


4.9 



Although our study is explorative, we are confident that Coccolithus has 
phenotypic evolution with motion in at least two layers. In addition, the 
posterior distributions for the pull parameter(s) in the second layer indicate 
that the fitness optima are not evolving as random walks, which was also 
the case for the optimum in Hansen, Pienaar and Orzack (2008), though 
there the optimum was modeled to track a random walk in the lower layer, 
while in our study the data suggests a stationary lower layer. 

7. Summary. A modeling framework for systems of related processes 
evolving in continuous time has been constructed. This framework has been 
applied to fossil data of size variability in marine unicellular algae spanning 
nearly 60 million years. A simulation study showed that for the amount of 
data in this application, the number of layers could be inferred. 

There is basic consensus among the models considered best using different 
criteria and in the property analysis as to the following model properties: 

(1) There is more than one layer of stochastic processes at play. (2) There 
is dependency between what happens in two different regions. (3) The are 
some regional differences in the nature of the dynamics, that is, there are 
some parameters that are site-specific. (4) The regional dependency does 
not stem primarily from the top layer, but from something further down 
the causal chain. (5) The hidden process seems to exhibit slow variation 
resulting in long autocorrelation in cell size. 

This suggests that both global and local processes influence the mean 
phenotype, through dynamic site-specific fitness optima that respond to an 
underlying process that is partly global. However, as already mentioned, we 
find no support for a forcing by a global temperature indicator series on 
Coccolithus size, in contrast to reports on other biotic groups [see Schmidt 
et al. (2004), Finkel et al. (2007)]. The regional differences in estimated 
model parameters might reflect contrasts in local environmental conditions, 
differences in morphotype [possibly (sub)species] composition or a combina- 
tion of both. These interpretations and additional sensitivity tests (e.g., how 
age models affect the model outcome) will be further explored in another 
publication [see Henderiks et al. (2012)]. 
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Thus, it seems that the framework can be used for studying and reaching 
tentative conclusions about the driving forces behind a phenotypic time se- 
ries. It should also be possible to use this framework in other settings both 
within and outside paleontology and evolutionary biology. Time series with 
irregular temporal resolution, due to missing data, breaks in the observa- 
tional scheme or for other reasons, are not uncommon. For such data our 
framework provides an alternative to methods based on auto regression with 
regular temporal resolution. 

The class of models described by linear SDEs is wide, but computation- 
ally feasible. They allow causality to be modeled and studied [see Schweder 
(2012)], and they accommodate latent hierarchical structures. Our model 
could, for instance, be expanded by including more layers, allowing for time 
series that are phylogenetically related, allowing for external processes being 
included as exogenous forcings in different layers or by using the framework 
also for modeling these time-series, thus making them an integral part of 
the analysis. Furthermore, geographic information could be incorporated by 
letting the correlation between sites depend on the distance between the 
sites, for example, in relation to ocean circulation. 

Source code. The source codes for our analysis programs can be found 
on the web page http://folk.uio.no/trondr/layered. 
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SUPPLEMENTARY MATERIAL 

Phenotypic evolution studied by layered stochastic differential equations — 
supplementary material (DOI: 10.1214/12-AOAS559SUPP; .pdf). Supple- 
mentary material: Mathematical details, description of the prior distribu- 
tion, Kalman filtering, practical restrictions, numerical methods, data is- 
sues, extra material on simulation studies and model selection results, and 
robustness analysis. 
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